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Abstract. The Noah's Ark Problem (NAP) is an NP-Hard optimiza- 
tion problem with relevance to ecological conservation management. It 
asks to maximize the phylogenetic diversity (PD) of a set of taxa given a 
fixed budget, where each taxon is associated with a cost of conservation 
and a probability of extinction. NAP has received renewed interest with 
the rise in availability of genetic sequence data, allowing PD to be used 
as a practical measure of biodiversity. However, only simplified instances 
of the problem, where one or more parameters are fixed as constants, 
have as of yet been addressed in the literature. We present NAPX, the 
first algorithm for the general version of NAP that returns a 1 — e ap- 



time where n is the number of species, and B is the total budget and h 
is the height of the input tree. We also provide improved bounds for its 
expected running time. 
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1 Introduction 
1.1 Motivation 

Measures of biodiversity are commonly used as indicators of environmental 
health. Biodiversity is presently being lost at an alarming rate, due largely to hu- 
man activity. It is speculated that this loss can lead to disastrous consequences 
if left unchecked [9]. Consequently, the discipline of conservation biology has 
arisen and a considerable amount of resources are being allocated to research 
and implement conservation projects around the world. 

A conservation strategy will necessarily depend on the measure of biodiversity 
used. Traditionally, indices based on species richness and abundance have been 
used to quantify the biodiversity of an ecosystem [8] . These indices are based on 
counting and do not account for genetic variance. Phylogenetic diversity (PD) 
[3] addresses this issue by taking into account evolutionary relationships de- 
rived from DNA or protein samples. The use of PD in biological conservation 



proximation of the optimal solution. It runs in O 




has become increasingly widespread as more phylogenetic information becomes 
available [7] . It is also used to determine diverse sequence samples in comparative 
genomics [10]. 

The Noah's Ark Problem (NAP) [13] is an abstraction of the fundamental 
problem of many conservation projects: how best to allocate a limited amount of 
resources to maximally conserve phylogenetic diversity. This is in turn a general- 
ization of the Knapsack Problem [4] and is therefore NP-Hard. Several algorithms 
have been proposed to solve special cases of the problem but, as yet, no non- 
heuristic solutions have been proposed to solve general instances of NAP. Given 
that NAP itself is an abstraction of realistic scenarios, it is important to have a 
general solution in order to be able to extend this framework to useful applica- 
tions. For this reason, we present an algorithm that can be used to compute an 
approximate solution for NAP in polynomial time, so long as the approximation 
factor is held constant, and total budget is polynomial in the input size. 

1.2 Definitions 

Throughout this paper, we use the following definition of a phylogenetic tree 
T, with notation consistent with that of [6]. T has a root of degree 2, interior 
vertices of degree 3, and n leaves, each associated with a species from set X. 
If an edge e of T is incident to a leaf, it is called a pendant edge. Otherwise 
e has exactly two adjacent edges, I and r, below it (not on the path from e to 
the root) and these are referred to as e's children. A is a function that assigns a 
non-negative branch length to each edge in T. The phylogenetic diversity of T, 
PD(T) is defined as 



where the summation is over each edge e of the tree. Intuitively, this measure 
corresponds to the amount of evolutionary history represented by T. 

The Noah's Ark Problem has the objective of maximizing the expected PD, 
K(PD), under the following constraints. Each taxon i £ X is associated with an 
initial survival probability dj, which can be increased to hi at some integer cost 
Ci] and the total expenditure cannot exceed the budget B. Since B is a factor in 
the running time, we assume that the budget and each cost have been divided 
by the greatest common divisor of all the costs. In the original formulation of 
NAP, each species was also associated with a utility value. However, in [6] it 
was shown that these values arc redundant as they can be incorporated into 
the branch lengths without altering the problem. To avoid accounting for de- 
generately small probability values, we make the assumption that the conserved 
survival probabilities are not exponentially small in n. In other words, there ex- 
ists a constant k such that bi > rT k for each i £ X. We feel this assumption is 
reasonable as it is unrealistic that money would be allocated to obtain such a 
negligible probability of survival. 

If a species survives, the information represented by its path to the root is 
conserved. Consequently, the probability that an edge survives is equivalent to 
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the probability that at least one leaf below it in T survives. Let C e be the set 
of leaves below e in the tree and S be the set of species selected for protection. 
E(PD\S), can be derived from (1) as follows: 

E(PD\s)=j2m[i- n a-*) n ( 2 ) 

e \ iecns jec e -s J 

where the summation is over all edges. NAP asks to maximize E(PD\S) subject 
to 

Our algorithm is based on decomposing T into clades which are associated with 
the edges of the tree. A clade corresponding to edge e, denoted JC ei is the minimal 
subtree of T containing e and C e , the set of leaves below it. The E(PD) of K. e 
can be computed as in (2) but summing only over edges in the clade. The entire 
tree can be considered a clade by attaching an edge of length to its root. If e 
has two descendant edges I and r, then we say JC e has two child clades JCi and 
K, r . 



1.3 Related Work 

Let en ^> bi NAP refer to the problem as described above, where the survival 
probabilities and cost of each taxon are input variables. Fixing one or more of 
these variables as constants produces a hierarchy of increasingly simpler sub- 
problems [11]. The simplest, 0^1 NAP, is equivalent to finding the set of B 
leaves whose induced subtree (including the root) has maximum PD and can be 
solved by a greedy algorithm [12] [10]. 1 NAP on ultrametric (all leaves 

equidistant from the root) trees and (1 — Xi) — > (1 — nxi) for general trees where 
Xi is a variable probability and n is a constant factor such that < k < 1 can 
likewise be solved in polynomial time by greedy algorithms [6] . Given that 1 
NAP is itself a generalization of the Knapsack problem which is NP-Hard, it is 
extremely unlikely that an exact, polynomial-time solution for this kind of NAP 
or any generalizations will ever be found. Pardi and Goldman [11] did find a 
pseudopolynomial-time dynamic programming algorithm for the 1 NAP 
on general (non-ultrametric) trees that makes the realistic assumption that B 
is polynomial in n. They also show that any instance of a, L 1 NAP can be 
transformed to an instance of 1 NAP, allowing their algorithm to solve 
such instances as well. 

This algorithm relies upon the observation that the solution to 1 NAP 
for any clade can be obtained from the solutions to its two child clades [11]. 
Which solutions to use depends on how the budget is allocated to the two sub- 
problems. If the budget at IC e is 6, then there are 6+1 ways to split it across ICi 
and IC r . By solving these b+ 1 pairs of subproblems, the optimal solution can be 
found in the pair with maximum total E(PD) (plus the expected contribution 
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of e). Recursively proceeding in this fashion from the root down would not yield 
an efficient algorithm as the number of possible budget divisions increases expo- 
nentially with each level of the tree. Instead, the clades are processed bottom-up 
from the leaves. All b + 1 scores are computed and stored in a dynamic program- 
ming table for each clade. Each score can be determined by taking the maximum 
of b + 1 possible scores of its child clades, which are already computed or com- 
puted directly from (2) if the clade contains a single leaf. Each table entry can 
therefore be computed in O(B) time. There are 0{B) entries per clade and 0(n) 
clades in the tree giving a total running time of 0{nB 2 ). 

This procedure does not work for dj b,- t NAP because this version of the 
problem docs not display the same optimal substructure [11]. In 1, the 
dynamic programming algorithm implicitly maximizes the survival probability 
of the clade in addition to its E(PD) value. The total score of the tree is a 
function of both of these values which is why the algorithm works for this case. 
In cii -A bi NAP, a budget assignment that maximizes survival probability of the 
clade does not guarantee that it will have maximal E(PD) and vice versa. The 
correct allocation cannot be made without knowledge of the entire tree; hence, 
the optimal substructure exploited by [11] for 1 NAP is not present. As 
an example, consider the instance of NAP in Figure 1 with B = 3. The optimal 
solution is to conserve w and y for E(PD) = 225. However, locally computing 
the best allocation of budget 1 for the clade containing y and z will select z for 
conservation, and any chance of obtaining the optimal solution will be lost. In 
this case, it is more important to maximize the survival probability of the clade 
rather than E(PD), but there is no way for an algorithm to be aware of this 
without globally solving the entire tree. 




Fig. 1. An example why the dynamic programming algorithm of [11] does not 
work for general instances of NAP. The optimal allocation for the clade contain- 
ing y and z is not part of a globally optimal solution. 
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2 NAPX Algorithm 
2.1 Description 



In this section we present NAPX, an 




dynamic program- 



ming algorithm for -A &, NAP that produces a (1 — e)-approximation of the 
optimal solution, where h denotes the height of T. As that in [11], our algorithm 
is only polynomial if B is polynomial in n. This assumption is justifiable if, for 
example, B is expressed in millions of dollars and its value will be a reason- 
ably small integer. Without loss of generality, we also assume that no single cost 
exceeds the budget. 

NAPX essentially generalizes the dynamic programming table of [11] by com- 
puting for each clade, each desired survival probability of the clade, and any 
budget between 1 and B, the maximum E(PD) score achievable while guaran- 
teeing this survival probability. This way, we need not make the choice between 
maximizing E(PD) or probability as the tables are constructed. From the defi- 
nition of K(PD) in (2), the probability of survival of an edge can be written as 
a function of its two children. Let P e denote the survival probability of edge e, 
and I and r be e's children. Then 

P e = Pi+P r - PiP r . (3) 

In the optimal solution for NAP on T, assume b dollars are assigned to clade /C e 
and e survives with probability p. It follows that i and b — i dollars are assigned 
to Ki and JC r respectively where < i < b. These subclades must survive with 
probabilities j and y^-j (or when p = j = 1), for some < j < p, in order to 
satisfy (3). Because the probability is continuous, we discretize it into intervals 
by rounding it down to the nearest multiple of a chosen constant a. Probabilities 
less than a chosen cutoff value p m in are rounded to zero. 

pe [0,a riog - p -"V..,a 2 ,a,l} 

If two non-zero probabilities lie in the same interval, their ratio is at most a. 
If they arc in consecutive intervals, their ratio is likewise bounded by a 2 . For 
notational convenience, we define a mapping ir(-) that rounds a probability to 
the lower bound of its corresponding interval. 




if p < p min 

a R°g p] otherwise . 



We now formally describe our algorithm. For each edge e, we construct a two- 
dimensional table T e where T e (b,p) stores the optimal expected diversity of JC e 
given that b dollars are assigned to it and it survives with a probability that lies 
no less than p. The table is constructed in the following manner if e is a pendant 
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edge incident to the leaf for species s. 
T e (b,p) 



a s X(e) if b < c s and p = 7r(a s ), 
6 s A(e) if b > c s and p = n(b s ), or (4) 
— oo otherwise. 



Otherwise, T e is computed from the tables of its two children, 7] and T r . 

T e (b,p)=pX(e)+ma X {T l {i,j)+T r (b-i,k)} (5) 

subject to 

^e {0,1, 2,. ..,6}, 
j,ke {0,^^ p™«» 1,...,a 2 ,a,l}, 
tt(j + k- jk)=p 

The E(PD) score for the entire tree can be obtained by attaching an edge e r 
of length to the root and finding ma,Xj{T er (B, j)}. The tables are computed 
from the bottom up, and each time an entry is filled, pointers are kept to the two 
entries in the child tables from which it was computed. This way the optimal 
budget allocation can be obtained by following the pointers down from the entry 
for the optimal score for e r . 



2.2 Approximation Ratio 

In this section, we express the worst-case approximation ratio as a function of 
the constants p m in and a introduced above, beginning with p m in- Note that 
since any species s with c s > B can be transformed into a new species s' with 
c s i = 0, b s i = a s and a s > — a s without affecting the outcome, we can safely 
assume that c s < B for all s £ X. 

Lemma 1. Let I be an instance of NAP for which there exists a constant k such 
that bi > n~ k > p m in for all i G S ■ Consider a transformed instance I' where 
all a,i values in the range (0,p m i n ) are rounded to 0. Let OPT (I) and OPT (I') 
be the expected PD scores of the optimal solutions to I and I' respectively. Then 
the ratio of these scores is bounded as follows: 

OPT(L') > (1 - n k+1 p mm )OPT(L) 

Proof. Let path(s) be the set of edges comprising the path from leaf s to the 
root. We define w(s) as the expected diversity of the path from s to the root if 
s is conserved: 

w(s) =b s ^2 •%)• 

eGpath(s) 

Let w max = max s gx {w(s)}. This value allows us to place a trivial lower bound 
on the optimal solution (recalling that we can assume that c s < B). 

w max < OPT(I). (6) 
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We also observe that if any species s survives with a non-zero probability smaller 
than pmin in the optimal solution, its contribution to OPT(7) will be bounded 

by P mmW ( s ^ i n follows that 



OPT(I') > OPT(I) - 



p min w{s) 



b s 

sex s 

Since b s > n~ k and w(s) < w max , we can express the bound as 
OPT(I') > OPT {I) - n Pmm ^™ a * , 

Dividing by OPT(I) yields 

OPT(I') _ n k+1 p min w max 
OPT(I) ~ OPT(I) 

From (6) we obtain 

° PT{I,) >l-n^ P ■ 

OPT(I) Pmin, 

which completes the proof. □ 



The size of the probability intervals in the tables, determined by a, also affects 
the approximation ratio. This relationship is detailed in the following lemma. 

Lemma 2. Let OPT e (b,p) denote the optimal expected PD score for clade K e 
if e survives with probability exactly p and b dollars are allocated to it. Now 
consider an instance of NAP such that all a s and b s are either or at least 
Pmin- For any OPT e (b,p) where e is at height h in the tree, there exists a table 
entry T e (b 7 p') constructed by NAPX such that the following conditions hold: 

i )T e (b,p')>a h OPT e (b,p) 
ii) p' > a h p 

Proof. If p = 0, then OPT e (b,p) — and the lemma holds. For the remainder 
of the proof, we assume p > p m in- The proof will proceed by induction on h, 
the height of e in the tree, beginning with the base case where h = 1 and e is a 
pendant connected to leaf s. We need only consider the cases where the optimal 
solution is defined. So without loss of generality, assume we have OPT e (b, a s ) = 
X(e)a s . From (4), we know there is an entry T e (b 7 Tr(a s )) = a s X(e) and therefore 
both i) and ii) hold. 

We now assume that the lemma holds for h < x and consider some edge e at 
height x + 1. By definition, OPT e (b,p) can be expressed in terms of its children 
/ and r. 

OPT e (b,p) = p\(e) + OPT^j) + OPT r (b - i, k) 
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where j + k — jk = p. From the induction hypothesis, there exist 



Ti(i,j')>a x OPTi{i,j) and 
T r (b - i, k') > a x OPT r (b - i, k) 

where j' > a x j and k' > a x k. Let q = j' + k' — j'k'. It follows that 

q > a x j + a x k- a 2x jk > a x p. 



(7) 



The left inequality in (7) holds because f + k' — j'k' increases as j' or k! increase, 
so long as their values do not exceed 1. This can be checked by observing that 
the partial derivatives with respect to j' and k' are 1 — k' and 1 — j' , respectively. 

Ti{i,j') and T r (b—i, k') will be considered when computing the entry T e (b,p') 
where p' — ir(q). Since q > p m i ni we have ir(q) > aq because it simply rounds q 
to the nearest multiple of a. Therefore, p' > a x+1 p and T e (b,p') can be expressed 
as follows. 

T e (b,p') > p'A(e) + T,(i, j') + T r (b - i, k') 

> a x+1 pX(e) + a x OPTi(i,j) + a x OPT(b - i, k) 

> a x+1 (p\(e) + OPTi + OPT(b -i,k) 
>a x+1 OPT e (b,p) 

□ 

Combining Lemmas 1 and 2 allows us to state that NAPX returns a solution 
that is at least a factor of (1 — n k+1 p min )a h of the optimal solution. In this 
section we show that these results also imply that a (1 — e) approximation can 
be obtained in polynomial time for an arbitrary constant e. 



Lemma 3. O 




probability intervals are required in the table 



|log(l-e)| 
in order to obtain a 1 — e approximation. 

Proof. The number of probability intervals, t, required for the table is bounded 
by the number of times 1 must be multiplied by a to reach p m i n . Hence a* < p m in 
and 

logp„ 



t 



log a 



(8) 



From Lemmas 1 and 2 we can obtain the desired approximation ratio by selecting 
a = \J (1 — e)i and p m i n — 1 ~/+r e . Plugging these values into (8) gives 



t = 



log 



log (^J(l-e)i 



2fe(log(l - yT^e) - (k + 1) logn) 
log(l - e) 
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It can be shown that log(l — \/l — e) is O(loge), so multiplying by — {- we can 
express t asymptotically as 

. Q ' fe(logn-loge) \ =Q ( h (logn + log \) \ 



■log(l-e) ) \ |log(l-e)| J 

a 

Theorem 1. NAPX is a (1 — e) -approximation with time complexity 

' nB 2 h 2 (logn + logi) 2 \ 



O 



log 2 (l - e) 



Proof. For each table entry T(b,p), we must find the maximum of all possible 
combinations of entries in the left and right child tables that satisfy b and p. 
These combinations correspond to the possible {i,j, k} triples from (5). There 
are 0(Bt 2 ) such combinations as i corresponds to the budget and j and k cor- 
respond to probability intervals. Furthermore, for fixed values of p and j, there 
are potentially 0{t) different values of k that could satisfy n(j + k — jk) due to 
rounding. It follows that a naive algorithm would have to compare all 0(Bt 2 ) 
combinations when computing the maximum in (5) for each table entry. 

Fortunately, because Tr(j+k—jk) is monotonically nondecreasing with respect 
to either j or k, we can directly compute for any fixed p and j the interval of k 
entries that satisfy n(j + k — jk) = p: 



log Q 



oep — ] 
1-J 



!og Q 



P- J 
1-3 



Finding the value of k in the interval such that T(b — i, k) is maximized is 
effectively a range maxima query (RMQ) on an array. Regardless of the size of 
the interval, such a query can be performed in constant time if instead of an array, 
the values are stored in a RMQ structure as described in [1]. Such structures 
are linear both in space and the time they take to create, meaning that we can 
use them to store each column in the table (corresponding to budget value i) 
without adversely affecting the complexity. Now, when given a pair {i,j}, the 
optimal value of k can be computed in constant time, bringing the complexity 
of filling a single table entry to O(Bt), the number of combinations of the pair 

There are 0(Bt) entries in each table, and a table for each of the 0(n) edges 
in the tree. The space complexity is therefore 0(nBt) and the time complexity is 
0(nB 2 t 2 ). Substituting t for the value that yields a (1 — e) approximation ratio 

' nB 2 h 2 (logn + logi) 2 \ 
, log 2 (l-e) r 



shown in Lemma 3 gives O 



2.3 Expected Running Time 

Since in general the height of a phylogenetic tree with n leaves is 0(n), the 
running time derived above is technically cubic in n. Fortunately, for most inputs 
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we can expect the height to be much smaller. In this section, we will provide 
improved running times for trees generated by the two principal random models. 
Additionally we will show that caterpillar trees, which should be the pathological 
worst-case topology according to the above analysis, actually have a much lower 
complexity. 

The Yule-Harding model [14] [5] , also known as the equal-rates-Markov model, 
assumes that trees are formed by a succession of random speciation events. The 
expected height of trees formed in this way, regardless of the speciation rate, is 



A caterpillar tree is a tree where all internal nodes are on a path beginning 
at the root, and therefore has height n. This implies that every internal edge 
has at least one child edge that is incident to a leaf. Suppose edge e has child I 
that is incident to the leaf for species s. This table only contains two meaningful 
values: T;(0, o s ) and Ti(c s ,b s ). Therefore to compute entry T e (b,p), only O(l) 
combinations of child table entries need to be compared and the time complexity 



3 Conclusion 

NAPX is, to our best knowledge, the first polynomial-time algorithm for a, &j 
NAP that places guarantees on the approximation ratio. While there are still 
some limitations, especially for large budgets or tree heights, our algorithm still 
significantly increases the number of instances of NAP that can be solved. More- 
over, our expected running time analysis shows that the algorithm will usually be 
much more efficient than its worst-case complexity suggests. This work towards 
a more general solution is important if the Noah's Ark Problem framework is 
to be used for real conservation projects. Some interesting questions do remain, 
however. Does NAP remain NP-Hard when the budget is constrained to be poly- 
nomial in n? We conjecture that it is, but the usual reduction from Knapsack is 
clearly no longer valid. We would also like to find an efficient algorithm whose 
complexity is independent of h and/or B. 
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